Hands-on Exercise 03-2

ggplot2
dplyr
gganimate
plotly
Programming Animated Statistical Graphics with R
Published

January 28, 2024

Modified

January 30, 2025

4.1 Overview

When telling a visually-driven data story, animated graphics tends to attract the interest of the audience and make deeper impression than static graphics. In this hands-on exercise, we will learn how to:

  1. Create animated data visualisation by using gganimate and plotly r packages
  2. Reshape data by using tidyverse package
  3. Process, wrangle and transform data by using dplyr package

4.1.1 Basic concepts of animation

When creating animations, the plot does not actually move. Instead, many individual plots are built and then stitched together as movie frames, just like an old-school flip book or cartoon. Each frame is a different plot when conveying motion, which is built using some relevant subset of the aggregate data. The subset drives the flow of the animation when stitched back together.

4.1.2 Terminology

Before we dive into the steps for creating an animated statistical graph, it’s important to understand some of the key concepts and terminology related to this type of visualization.

  1. Frame: In an animated line graph, each frame represents a different point in time or a different category. When the frame changes, the data points on the graph are updated to reflect the new data.

  2. Animation Attributes: The animation attributes are the settings that control how the animation behaves. For example, you can specify the duration of each frame, the easing function used to transition between frames, and whether to start the animation from the current frame or from the beginning.

Before you start making animated graphs, you should first ask yourself: Does it makes sense to go through the effort? If you are conducting an exploratory data analysis, a animated graphic may not be worth the time investment. However, if you are giving a presentation, a few well-placed animated graphics can help an audience connect with your topic remarkably better than static counterparts.

4.2 Getting Started

4.2.1 Loading the R packages

First, write a code chunk to check, install and load the following R packages:

  • plotly, R library for plotting interactive statistical graphs.

  • gganimate, an ggplot extension for creating animated statistical graphs.

  • gifski converts video frames to GIF animations using pngquant’s fancy features for efficient cross-frame palettes and temporal dithering. It produces animated GIFs that use thousands of colors per frame.

  • gapminder: An excerpt of the data available at Gapminder.org. We just want to use its country_colors scheme.

  • tidyverse, a family of modern R packages specially designed to support data science, analysis and communication task including creating static statistical graphs.

pacman::p_load(readxl, gifski, gapminder,
               plotly, gganimate, tidyverse, ggthemes, DT)

4.2.2 Importing and Examing the data

In this hands-on exercise, the Data worksheet from GlobalPopulation Excel workbook will be used.

We first use read_xls of tidyverse package to import the document:

globalpop_raw <- read_xls("data/GlobalPopulation.xls",sheet="Data") 

Next, we use function str() , head() from R, function summarise_all() and n_distinct() from dplyr to examine the data structure and variable types:

  • The dataset contains 6,204 observations with no missing values
  • Country: The dataset contains 222 countries stored as character data type

  • Year: The data spans from 1996 to 1950 and stored as double data type

  • Young: Based on the data context, the “Young” variable represents the percentage of young people in the population with values ranging from 15.5% to 109.2%. It stored as double data type

  • Old: Based on the data context, the “Old” variable represents the percentage of elderly people in the population with values ranging from 1% to 77.1% It stored as double data type

  • Population: The values ranging from 3 K to 1,807,878.6 K within the data period

  • Continent: The dataset contains 6 continenet stored as character data type

str(globalpop_raw)
tibble [6,204 × 6] (S3: tbl_df/tbl/data.frame)
 $ Country   : chr [1:6204] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Year      : num [1:6204] 1996 1998 2000 2002 2004 ...
 $ Young     : num [1:6204] 83.6 84.1 84.6 85.1 84.5 84.3 84.1 83.7 82.9 82.1 ...
 $ Old       : num [1:6204] 4.5 4.5 4.5 4.5 4.5 4.6 4.6 4.6 4.6 4.7 ...
 $ Population: num [1:6204] 21560 22913 23898 25268 28514 ...
 $ Continent : chr [1:6204] "Asia" "Asia" "Asia" "Asia" ...
head(globalpop_raw)
# A tibble: 6 × 6
  Country      Year Young   Old Population Continent
  <chr>       <dbl> <dbl> <dbl>      <dbl> <chr>    
1 Afghanistan  1996  83.6   4.5     21560. Asia     
2 Afghanistan  1998  84.1   4.5     22913. Asia     
3 Afghanistan  2000  84.6   4.5     23898. Asia     
4 Afghanistan  2002  85.1   4.5     25268. Asia     
5 Afghanistan  2004  84.5   4.5     28514. Asia     
6 Afghanistan  2006  84.3   4.6     31057  Asia     
globalpop_raw %>%
  summarise_all(~n_distinct(.))
# A tibble: 1 × 6
  Country  Year Young   Old Population Continent
    <int> <int> <int> <int>      <int>     <int>
1     222    28   819   589       5791         6
# check if there're any missing values
any(is.na(globalpop_raw))
[1] FALSE
summary(globalpop_raw)
   Country               Year          Young             Old       
 Length:6204        Min.   :1996   Min.   : 15.50   Min.   : 1.00  
 Class :character   1st Qu.:2010   1st Qu.: 25.70   1st Qu.: 6.90  
 Mode  :character   Median :2024   Median : 34.30   Median :12.80  
                    Mean   :2023   Mean   : 41.66   Mean   :17.93  
                    3rd Qu.:2038   3rd Qu.: 53.60   3rd Qu.:25.90  
                    Max.   :2050   Max.   :109.20   Max.   :77.10  
   Population         Continent        
 Min.   :      3.3   Length:6204       
 1st Qu.:    605.9   Class :character  
 Median :   5771.6   Mode  :character  
 Mean   :  34860.9                     
 3rd Qu.:  22711.0                     
 Max.   :1807878.6                     

4.2.3 Handling Data Issues

4.2.3.1 Data Type Issues

  1. Year: Since year is a whole number rather than a decimal, we should transform its data type from double <dbl> to integer<int>.
  2. Country and Continent: Since these two categorical variables will be analyzed further, we need to transform their data type from character <chr> to factor<fctr>. In R, factors are used to handle categorical data and ordered variable.

Here, we use mutate_each_() of dplyr package to convert all character data type into factor, and use mutate of dplyr package to convert data values of Year field into integer.

col <- c("Country","Continent")

globalpop_raw <- read_xls("data/GlobalPopulation.xls",sheet="Data") %>%
  mutate_each_(funs(factor(.)),col) %>%
  mutate(Year = as.integer(Year))
  
head(globalpop_raw)
# A tibble: 6 × 6
  Country      Year Young   Old Population Continent
  <fct>       <int> <dbl> <dbl>      <dbl> <fct>    
1 Afghanistan  1996  83.6   4.5     21560. Asia     
2 Afghanistan  1998  84.1   4.5     22913. Asia     
3 Afghanistan  2000  84.6   4.5     23898. Asia     
4 Afghanistan  2002  85.1   4.5     25268. Asia     
5 Afghanistan  2004  84.5   4.5     28514. Asia     
6 Afghanistan  2006  84.3   4.6     31057  Asia     

Unfortunately, mutate_each_() was deprecated in dplyr 0.7.0. and funs() was deprecated in dplyr 0.8.0. In view of this, we will re-write the code by using mutate_at() as shown in the code chunk below.

col <- c("Country","Continent")

globalpop_raw <- read_xls("data/GlobalPopulation.xls",sheet="Data") %>%
  mutate_at(col, as.factor) %>%
  mutate(Year = as.integer(Year))
  
head(globalpop_raw)
# A tibble: 6 × 6
  Country      Year Young   Old Population Continent
  <fct>       <int> <dbl> <dbl>      <dbl> <fct>    
1 Afghanistan  1996  83.6   4.5     21560. Asia     
2 Afghanistan  1998  84.1   4.5     22913. Asia     
3 Afghanistan  2000  84.6   4.5     23898. Asia     
4 Afghanistan  2002  85.1   4.5     25268. Asia     
5 Afghanistan  2004  84.5   4.5     28514. Asia     
6 Afghanistan  2006  84.3   4.6     31057  Asia     

Instead of using mutate_at()across() can be used to derive the same outputs.

col <- c("Country","Continent")

globalpop_raw <- read_xls("data/GlobalPopulation.xls",sheet="Data") %>%
  mutate(across(all_of(col), as.factor)) %>%
  mutate(Year = as.integer(Year))

head(globalpop_raw)
# A tibble: 6 × 6
  Country      Year Young   Old Population Continent
  <fct>       <int> <dbl> <dbl>      <dbl> <fct>    
1 Afghanistan  1996  83.6   4.5     21560. Asia     
2 Afghanistan  1998  84.1   4.5     22913. Asia     
3 Afghanistan  2000  84.6   4.5     23898. Asia     
4 Afghanistan  2002  85.1   4.5     25268. Asia     
5 Afghanistan  2004  84.5   4.5     28514. Asia     
6 Afghanistan  2006  84.3   4.6     31057  Asia     

4.2.3.2 Data Quality Issues

The data summary statistics show that the maximum value of Young% is 109.2%, indicating inaccurate or incomplete data. This is problematic because Young% + Old% should be less than or equal to 100%. A value exceeding this would imply a negative Mid-aged%, which is neraly impossible and suggests inaccurate or missing data in this dataset.

summary(globalpop_raw)
        Country          Year          Young             Old       
 Afghanistan:  28   Min.   :1996   Min.   : 15.50   Min.   : 1.00  
 Albania    :  28   1st Qu.:2010   1st Qu.: 25.70   1st Qu.: 6.90  
 Algeria    :  28   Median :2024   Median : 34.30   Median :12.80  
 Andorra    :  28   Mean   :2023   Mean   : 41.66   Mean   :17.93  
 Angola     :  28   3rd Qu.:2038   3rd Qu.: 53.60   3rd Qu.:25.90  
 Anguilla   :  28   Max.   :2050   Max.   :109.20   Max.   :77.10  
 (Other)    :6036                                                  
   Population                Continent   
 Min.   :      3.3   Africa       :1568  
 1st Qu.:    605.9   Asia         :1454  
 Median :   5771.6   Europe       :1344  
 Mean   :  34860.9   North America: 976  
 3rd Qu.:  22711.0   Oceania      : 526  
 Max.   :1807878.6   South America: 336  
                                         

Below are 63 observations with data accuracy issue. To maintain dataset’s integrity, we should remove these problematic countries records.

dq_issues <- subset(globalpop_raw, Young > 100 | (Young + Old) > 100)

table <- DT::datatable(dq_issues, class= "display",
              caption = "Table 1: Observations with data quality issues") %>%
  formatStyle(
    columns = colnames(dq_issues), 
    fontSize = '12px', 
    fontFamily = 'Helvetica', 
    lineHeight = '1.2'
  )
table

After remove problematic records, there are 5,953 observations remained with 213 distinct countries.

c_removed = unique(dq_issues$Country)
globalPop <- subset(globalpop_raw, !(Country %in% c_removed))
str(globalPop)
tibble [5,953 × 6] (S3: tbl_df/tbl/data.frame)
 $ Country   : Factor w/ 222 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Year      : int [1:5953] 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 ...
 $ Young     : num [1:5953] 83.6 84.1 84.6 85.1 84.5 84.3 84.1 83.7 82.9 82.1 ...
 $ Old       : num [1:5953] 4.5 4.5 4.5 4.5 4.5 4.6 4.6 4.6 4.6 4.7 ...
 $ Population: num [1:5953] 21560 22913 23898 25268 28514 ...
 $ Continent : Factor w/ 6 levels "Africa","Asia",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(globalPop)
        Country          Year          Young            Old       
 Afghanistan:  28   Min.   :1996   Min.   :15.50   Min.   : 1.00  
 Albania    :  28   1st Qu.:2010   1st Qu.:25.50   1st Qu.: 7.10  
 Algeria    :  28   Median :2024   Median :33.40   Median :13.70  
 Andorra    :  28   Mean   :2023   Mean   :40.19   Mean   :18.39  
 Angola     :  28   3rd Qu.:2038   3rd Qu.:50.90   3rd Qu.:26.50  
 Anguilla   :  28   Max.   :2050   Max.   :94.80   Max.   :77.10  
 (Other)    :5785                                                 
   Population                Continent   
 Min.   :      3.3   Africa       :1372  
 1st Qu.:    597.8   Asia         :1399  
 Median :   5580.3   Europe       :1344  
 Mean   :  35028.8   North America: 976  
 3rd Qu.:  22093.1   Oceania      : 526  
 Max.   :1807878.6   South America: 336  
                                         
globalPop %>% summarise_all(~n_distinct(.))
# A tibble: 1 × 6
  Country  Year Young   Old Population Continent
    <int> <int> <int> <int>      <int>     <int>
1     213    28   756   589       5549         6

4.3 Animated Data Visualisation: gganimate methods

gganimate extends the grammar of graphics as implemented by ggplot2 to include the description of animation. It does this by providing a range of new grammar classes that can be added to the plot object in order to customise how it should change with time.

  • transition_*() defines how the data should be spread out and how it relates to itself across time.

  • view_*() defines how the positional scales should change along the animation.

  • shadow_*() defines how data from other points in time should be presented in the given point in time.

  • enter_*()/exit_*() defines how new data should appear and how old data should disappear during the course of the animation.

  • ease_aes() defines how different aesthetics should be eased during transitions.

4.3.1 Building a static population bubble plot

In the code chunk below, the basic ggplot2 functions are used to create a static bubble plot.

ggplot(globalPop, aes(x = Old, y = Young,
                      size = Population,
                      colour = Country))+
  geom_point(alpha=0.7, show.legend = FALSE)+
  scale_colour_manual(values = country_colors)+
  scale_size(range= c(2,12))+
  labs(title = 'Global Population Change from 1996 to 2050',
       subtitle = 'Year:{frame_time}',
       x = '% Aged',
       y = '% Young')+
  theme_economist(base_size = 8)

4.3.2 Building the animated bubble plot

In the code chunk below,

  • transition_time() of gganimate is used to create transition through distinct states in time (i.e. Year). frame_time is a special placeholder (dynamic title) in gganimate.

  • ease_aes() is used to control easing of aesthetics. The default is linear. Other methods are: quadratic, cubic, quartic, quintic, sine, circular, exponential, elastic, back, and bounce.

    ggplot(globalPop, aes(x = Old, y= Young,
                          size = Population, colour = Country))+
      geom_point(alpha = 0.7, show.legend = FALSE)+
      scale_colour_manual(values = country_colors)+
      scale_size(range = c(2,12))+   # control point size to be 2~12
      labs(title = 'Global Population Change from 1996 to 2050',
           subtitle = 'Year:{frame_time}',   # {frame_time} is a special placeholder (dynamic title) in gganimate
           x = '% Aged',
           y = '% Young')+
      transition_time(Year)+
      ease_aes('cubic-in-out')+
      theme_economist(base_size = 8)

4.4 Animated Data Visualisation: plotly

In Plotly R package, both ggplotly() and plot_ly() support key frame animations through the frame argument/aesthetic. They also support an ids argument/aesthetic to ensure smooth transitions between objects with the same id (which helps facilitate object constancy).

4.4.1 Building an animated bubble plot: ggplotly() method

In this sub-section, we will create an animated bubble plot by using ggplotly() method.

Things to learn from the code chunk
  • Appropriate ggplot2 functions are used to create a static bubble plot. The output is then saved as an R object called gg.

  • ggplotly() is then used to convert the R graphic object into an animated svg object.

gg <- ggplot(globalPop, aes(x = Old, y = Young,
                            size = Population, colour = Country))+
  geom_point(aes(size = Population,
                 frame = Year),
             alpha = 0.7,
             show.legend = FALSE) +
  scale_colour_manual(values = country_colors)+
  # control the size of points from 2 to 12
  scale_size(range = c(2,12))+
  labs(title = 'Global Population Change from 1996 to 2050',
       x = '% Aged', 
       y = '% Young')+
  theme_wsj(base_size = 8) + 
  theme(axis.title.x = element_text(size = 12, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"))

ggplotly(gg)

4.4.1.1 Improvements Needed

  1. Legend: Notice that although show.legend = FALSE argument was used, the legend still appears on the plot. To overcome this problem, theme(legend.position='none') should be used as shown in the plot and code chunk below.
  2. Color: Although the “country_colors” palette from gapminder provides colors for 142 countries, our dataset contains 213 countries, causing many data points to appear in grey. To improve visual distinction, we should color code the data by “Continent” rather than “Country”.
  3. Tooltips: To improve the readability of the plot, tooltips are customized with detailed infomations by using text().
gg <- ggplot(globalPop, aes(x = Old, y = Young,
                            size = Population, colour = Continent,
                            text = paste("Year:",Year, # customize the content in tooltips
                                         "<br>Continent:",Continent,
                                         "<br>Country:", Country,
                                         "<br>Population:", scales::comma(Population), "K",
                                         "<br>Old:",round(Old,2),"%",
                                         "<br>Young:",round(Young,2),"%")))+ 
  geom_point(aes(frame = Year),alpha = 0.7) +
  scale_size(range = c(2,12))+
  labs(title = 'Global Population Change from 1996 to 2050',
       x = '% Aged', 
       y = '% Young')+
  theme_wsj(base_size = 8)+ scale_color_wsj()+
  theme(axis.title.x = element_text(size = 12, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        legend.position = 'none') # remove legend


ggplotly(gg, tooltip = "text")

4.4.2 Building an animated bubble plot: plot_ly() method

In this sub-section, you will learn how to create an animated bubble plot by using plot_ly() method.

To customized the layout of plot_ly(), we need to use layout().

bp <- globalPop %>%
  plot_ly(x = ~Old, y= ~Young,
          size = ~Population, color = ~Continent,
          sizes = c(2,100),
          frame = ~Year, text = ~Country,
          hoverinfo = "text",
          type = "scatter", mode = "markers") %>%
  layout(showlegend = FALSE,
         title = list (text = "Global Population Change from 1996 to 2050",
                       font = list(size = 15,family="Georgia", face = "bold")),
         xaxis = list(title = "% Aged", 
                      titlefont = list(size = 12, family = "Georgia")),
         yaxis = list(title = "% Young", 
                      titlefont = list(size = 12, family = "Georgia")),
         plot_bgcolor = "#f3f1e9",
         paper_bgcolor = "#f3f1e9")

bp

4.5 Reference